White Wine Analysis by Craig Nicholson

What factors or features create a good tasting white wine.

Correlation of the features

Moderate to Strongly Correlated / features which are corrleated

These features are the ones we want to avoid adding to the linear model
- fixed.acidity <> pH
- volatile.acidity
- citric.acid <> volatile.acidity
- residual.sugar <> free.sulfur.dioxide, total.sulfur.dioxide, density .84, alchohol
- chlorides <> none
- free.sulfur.dioxide <> residual.sugar , total.sulfur.dioxide .64, density, alchohol
- total.sulfur.dioxide <> residual.sugar , free.sulfur.dioxide .62, density, alchohol
- density <> volatile.acidity ,residual.sugar , free.sulfur.dioxide, total.sulfur.dioxide, alcohol
- pH <> volatile.acidity
- sulphates <> NONE
- alchohol <> residual.sugar , chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density .78

Choices for Linear Regression

  • (+) sulphates
  • (+) alcohol
  • (-) volatile.acidity

Linear Model Results

Coefficients

 (Intercept)         sulphates           alcohol  volatile.acidity  
      2.8030            0.4157            0.3250           -1.9629 

Summary Of the Model

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3186 -0.4854 -0.0406  0.4914  3.1555 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.802988   0.109706  25.550  < 2e-16 ***
sulphates         0.415712   0.096580   4.304 1.71e-05 ***
alcohol           0.325028   0.008972  36.229  < 2e-16 ***
volatile.acidity -1.962858   0.109589 -17.911  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7707 on 4894 degrees of freedom
Multiple R-squared:  0.2431,    Adjusted R-squared:  0.2426 
F-statistic: 523.9 on 3 and 4894 DF,  p-value: < 2.2e-16

Residual Plots

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -3.156000 -0.491700  0.040260 -0.000303  0.485100  3.318000

Q-Q Plot of Residuals

The deviations from the straight line are minimal. This indicates normal distribution. There is some variation around the tails.

Univariate Plots Section

The goal here is to focus on a set of defined features:

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Univariate Analysis

What is the structure of your dataset?

## 'data.frame':    4898 obs. of  15 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality_predicated  : num  5.32 5.51 5.72 5.74 5.74 ...
##  $ residuals           : num  -0.68 -0.495 -0.281 -0.265 -0.265 ...

What is/are the main feature(s) of interest in your dataset?

  • Quality (score between 0 and 10)
  • Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
  • Alcohol: the percent alcohol content of the wine
  • Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

SO2 plays a very important role in preventing oxidization and maintaining a wine’s freshness

Sweet wines get the biggest doses because sugar combines with and binds a high proportion of any SO2 added. To get the same level of free sulphur dioxide, the total concentration has to be higher than for dry wines. http://www.morethanorganic.com/sulphur-in-the-bottle

Reference: text document from the dataset

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  • citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
  • total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
  • residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
  • free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

Did you create any new variables from existing variables in the dataset?

Yes, new varibles have been added to the dataset. - predicited values for quality - residuals - Let’s create a ratio with a couple of variables … has to be meaningful

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes. I would make the quality a factor and remove the factor. A few plots work better when the quality was a factor instead of integer.

Bivariate Plots Section

Plots with Negative Slopes

The Residual Sugar plot was a surprise. It seems that the sweeter the wine due to more sugar the lower the quality wlll be of the wine.

Plots with Postive Slopes

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

We had features which had negative slopes and features which had postive slopes when compared with quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes/No

What was the strongest relationship you found?

Alcohol and Quality.

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Features that strenghten each other… are we we discussing highly correlated values where a rise in one feature causes a rise in the correlated feature? Or we we trying to convey something else here?

Were there any interesting or surprising interactions between features?

The plots are a mess right now

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes. I created a linear model.

Results for Linear Regression

Results listed above.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection

We had a low number of low quality and high quality rated wines in this dataset.

For instance, the citric acid and residual sugar levels are more important in white wine.

Moreover, the volatile acidity has a negative impact, since acetic acid is the key ingredient in vinegar. The most intriguing result is the high importance of sulphates, ranked first for both cases. Oenologically this result could be very interesting. An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.

References

MIT Results sulphates - 23% alcohol - 14% residual sugar - 13% citric acid - 10% total sulfur dioxide 9% free sulfur dioxide 8.5% volatile acidity 8% density 7% pH 6% chlorides 3% fixed acidity 2%

Pearson’s r-correlation If r = +.70 or higher Very strong positive relationship +.40 to +.69 Strong positive relationship +.30 to +.39 Moderate positive relationship +.20 to +.29 weak positive relationship +.01 to +.19 No or negligible relationship -.01 to -.19 No or negligible relationship -.20 to -.29 weak negative relationship -.30 to -.39 Moderate negative relationship -.40 to -.69 Strong negative relationship -.70 or higher Very strong negative relationship

Reference: http://faculty.quinnipiac.edu/libarts/polsci/Statistics.html